library(ggplot2)
library(GGally)
library(anytime)
library(dplyr)
library(caret)
library(ROCR)
library(scales)
library(randomForest)
library(gridExtra)
library(sqldf)
library(plotly)
library(scales)
library(plyr)
unit.data = read.csv('Unit_Level.csv')
lot.data = read.csv('Lot_Level.csv')
summary(unit.data)
## UNIT_ID UNIT_PROCESS_DATE PARAMETER1
## 001H5KA42Xk5: 1 01/24/2018 05:29:18 AM: 4552 Min. :0.210
## 002rDzPJZV2c: 1 01/24/2018 09:44:42 AM: 4532 1st Qu.:2.290
## 002SIV9dq2dw: 1 01/24/2018 11:18:32 PM: 3420 Median :2.790
## 004jlyVK8xcR: 1 01/27/2018 02:13:01 PM: 3417 Mean :2.855
## 004JohgXsn8M: 1 01/27/2018 04:35:09 PM: 3411 3rd Qu.:3.390
## 007Mn160JQuT: 1 01/28/2018 07:48:32 AM: 3402 Max. :6.920
## (Other) :276957 (Other) :254229 NA's :6572
## PARAMETER2 PARAMETER3 PARAMETER4 PARAMETER5
## Min. :-10.13 Min. :-1.680 Min. :-554.420 Min. :-573.610
## 1st Qu.: -6.75 1st Qu.:-0.012 1st Qu.: -8.470 1st Qu.: -5.710
## Median : -5.80 Median : 0.000 Median : -1.350 Median : 0.620
## Mean : -5.75 Mean : 0.000 Mean : -0.006 Mean : 0.009
## 3rd Qu.: -4.78 3rd Qu.: 0.012 3rd Qu.: 6.940 3rd Qu.: 5.975
## Max. : 26.09 Max. : 1.716 Max. : 635.510 Max. : 361.370
## NA's :6572 NA's :6572 NA's :6572 NA's :6572
## PARAMETER6 PARAMETER7 PARAMETER8 PARAMETER9
## Min. : 623.0 Min. :-112.2 Min. :-0.091 Min. :-600.9
## 1st Qu.: 889.0 1st Qu.: 608.0 1st Qu.:-0.023 1st Qu.:-387.5
## Median : 933.0 Median : 696.7 Median :-0.020 Median :-340.2
## Mean : 934.2 Mean : 693.3 Mean :-0.021 Mean :-335.6
## 3rd Qu.: 976.0 3rd Qu.: 780.4 3rd Qu.:-0.017 3rd Qu.:-288.2
## Max. :3494.0 Max. :2576.0 Max. : 0.000 Max. : 799.2
## NA's :6572 NA's :6572 NA's :6572 NA's :6572
## PARAMETER10 PARAMETER11 PARAMETER12 PARAMETER13
## Min. :-1862.06 Min. :-544.82 Min. : 0.00 Min. : 17.90
## 1st Qu.: -613.40 1st Qu.:-276.11 1st Qu.: 10.34 1st Qu.: 67.16
## Median : -550.96 Median :-247.96 Median : 17.25 Median : 87.58
## Mean : -567.03 Mean :-246.36 Mean : 19.44 Mean : 90.66
## 3rd Qu.: -506.06 3rd Qu.:-218.71 3rd Qu.: 25.41 3rd Qu.:110.87
## Max. : -35.68 Max. : 51.14 Max. :494.32 Max. :524.33
## NA's :6572 NA's :6572 NA's :6572 NA's :6572
## UNIT_CARRIER_POS_X UNIT_CARRIER_POS_Y RESPONSE_FLAG
## Min. :0 Min. :0.0 Min. :0.00000
## 1st Qu.:1 1st Qu.:0.0 1st Qu.:0.00000
## Median :2 Median :1.0 Median :0.00000
## Mean :2 Mean :0.5 Mean :0.00682
## 3rd Qu.:3 3rd Qu.:1.0 3rd Qu.:0.00000
## Max. :4 Max. :1.0 Max. :1.00000
## NA's :5236 NA's :5236
summary(lot.data)
## UNIT_ID LOT_ID MATERIAL1_SUPPLIER
## 001H5KA42Xk5: 1 X_04D213M804: 4552 : 98128
## 002rDzPJZV2c: 1 X_04D284M804: 4532 MFG1: 30293
## 002SIV9dq2dw: 1 X_04D486M804: 3420 MFG2:148542
## 004jlyVK8xcR: 1 X_04E339M804: 3417
## 004JohgXsn8M: 1 X_04E374M804: 3411
## 007Mn160JQuT: 1 X_05C286M805: 3402
## (Other) :276957 (Other) :254229
## MATERIAL1_SUPPLIER_LOT_ID MATERIAL2_SUPPLIER MATERIAL2_SUPPLIER_FACILITY
## :126596 : 5236 . : 5236
## F21ZZ2364A: 790 Tech1: 4365 SE : 84621
## F21ZZ2364S: 9948 Tech2: 89997 SW : 5376
## F21ZZ2364Y: 1102 Tech3:177365 Tech1: 4365
## F21ZZ2365U: 13372 Tech3:177365
## MIXED :125155
##
## MATERIAL3_SUPPLIER MATERIAL3_SUPPLIER_LOT_ID FAJ_TOOL_ID
## : 70754 :70754 : 70754
## Tech1:206209 JZ7745.V45:48373 FAJ011: 1770
## JZ7743.V52:45317 FAJ015: 490
## JZ7767.V21:41985 FAJ211:198175
## JZ7746.V35:18250 FAJ213: 4580
## JZ7743.V51:13742 MIXED : 1194
## (Other) :38542
## VD_TOOL_ID VD_TOOL_LANE NX_TOOL_ID NX_TOOL_COMPARTMENT
## VDZ001 :108559 . : 5236 NXG1911:159543 : 6572
## VDZ004 : 63357 BACK :135945 NXG1933: 93017 BOTTOM:127300
## VDZ006 : 53337 FRONT:135782 NXG1903: 21636 TOP :143091
## : 39842 NXG1901: 2249
## MIXED : 9719 TPS004 : 406
## VDZ002 : 1119 TPS005 : 101
## (Other): 1030 (Other): 11
sapply(unit.data, function(col) sum(is.na(col))) %>% round(digits=2)
## UNIT_ID UNIT_PROCESS_DATE PARAMETER1
## 0 0 6572
## PARAMETER2 PARAMETER3 PARAMETER4
## 6572 6572 6572
## PARAMETER5 PARAMETER6 PARAMETER7
## 6572 6572 6572
## PARAMETER8 PARAMETER9 PARAMETER10
## 6572 6572 6572
## PARAMETER11 PARAMETER12 PARAMETER13
## 6572 6572 6572
## UNIT_CARRIER_POS_X UNIT_CARRIER_POS_Y RESPONSE_FLAG
## 5236 5236 0
We can observe that there are 6572 missing values in all Parameters.All 6572 values are for Response Flag=0. We have a class imbalance with just 0.68% data for class 1.Removal of these 6572 rows(for class 0) won’t impact the class imbalance. Hence, removing these values won’t impact our analysis.
unit.data = na.omit(unit.data)
sapply(unit.data, function(col) sum(is.na(col))) %>% round(digits=2)
## UNIT_ID UNIT_PROCESS_DATE PARAMETER1
## 0 0 0
## PARAMETER2 PARAMETER3 PARAMETER4
## 0 0 0
## PARAMETER5 PARAMETER6 PARAMETER7
## 0 0 0
## PARAMETER8 PARAMETER9 PARAMETER10
## 0 0 0
## PARAMETER11 PARAMETER12 PARAMETER13
## 0 0 0
## UNIT_CARRIER_POS_X UNIT_CARRIER_POS_Y RESPONSE_FLAG
## 0 0 0
Now, there are no missing values in data.
Let us plot density plots to find out,if there is skweness in data.
All the parameters seems to be normal, except parameter 10 which is left skewed and Parameter 12 which is right skewed. These two parameters contain negative and zero values.Therefore, we wont be able to perform log or root transformations.
Let us draw boxplots to check if there are many outliers in data.
We can observe that there are a lot of outliers in all these Parameters. Since, objective of our case study is failure analysis, outliers might play a very important role in detecting the faults. Therefore, we cannot remove them. We can scale them to 5th and 99th percntile,but that might effect our prediction too.Therefore, We need to use methods which are robust to outliers to find the best parameters affecting failure.
Looking at Summary of unit data, we observe that the features are in different ranges. we need to scale those features to get all values in same range.Z normalization is one of the powerful techniques to do that. This is because this normalization technique is robust to outliers.
scaled.para = as.data.frame(scale(unit.data[3:15]))
summary(scaled.para)
## PARAMETER1 PARAMETER2 PARAMETER3
## Min. :-3.53426 Min. :-3.19849 Min. :-56.26032
## 1st Qu.:-0.75450 1st Qu.:-0.73029 1st Qu.: -0.40430
## Median :-0.08629 Median :-0.03656 Median : 0.00782
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.71556 3rd Qu.: 0.70828 3rd Qu.: 0.39921
## Max. : 5.43312 Max. :23.25075 Max. : 57.46839
## PARAMETER4 PARAMETER5 PARAMETER6
## Min. :-39.16179 Min. :-51.81561 Min. :-4.80485
## 1st Qu.: -0.59788 1st Qu.: -0.51658 1st Qu.:-0.69793
## Median : -0.09495 Median : 0.05522 Median :-0.01859
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.49062 3rd Qu.: 0.53894 3rd Qu.: 0.64531
## Max. : 44.89051 Max. : 32.64217 Max. :39.52211
## PARAMETER7 PARAMETER8 PARAMETER9
## Min. :-6.07041 Min. :-12.7462 Min. :-3.63140
## 1st Qu.:-0.64249 1st Qu.: -0.4786 1st Qu.:-0.71046
## Median : 0.02553 Median : 0.1520 Median :-0.06267
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.65648 3rd Qu.: 0.6649 3rd Qu.: 0.64946
## Max. :14.18888 Max. : 3.7858 Max. :15.53492
## PARAMETER10 PARAMETER11 PARAMETER12
## Min. :-16.1898 Min. :-6.98662 Min. :-1.6991
## 1st Qu.: -0.5796 1st Qu.:-0.69639 1st Qu.:-0.7959
## Median : 0.2010 Median :-0.03743 Median :-0.1916
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.7623 3rd Qu.: 0.64728 3rd Qu.: 0.5216
## Max. : 6.6427 Max. : 6.96420 Max. :41.4977
## PARAMETER13
## Min. :-2.27954
## 1st Qu.:-0.73627
## Median :-0.09653
## Mean : 0.00000
## 3rd Qu.: 0.63313
## Max. :13.58647
Let us merge the Lot level data and unit level data based on UNIT_ID.
Calculating defect percentage and creating three subsets: Data with Defect percentage greater than 0,Defect percentage greater than 1 and Defect percentages greater than equal to 0.
We can zoom out and zoom in graph to analyze different time frames by selecting an area on graph. Also every label indicates the Lot ID along with the timestamp at that point to indicate that which LOT_ID failed at that point in time.
For more detailed analysis, Let us plot Defect percentages greater than 1 over time.
This graph shows that 17th January to 25th January is the most critical time. It has a very high percentage of defects. We need to analyze this time frame more closely to find out the defective lots in these time frame.
This gives us more insight about which LOT_ID fails on which date. On the basis of this, we can try to analyze about tools and lot level parameters in this time frame to find out the causes of high defect rate.
Let us create a validation set to test our model.
df1 = data.frame(scaled.para,RESPONSE_FLAG = unit.data[18])
test_idx = sample(1:nrow(df1),size=floor(nrow(df1)/4))
test <- df1[test_idx,]
train = df1[-test_idx,]
train$RESPONSE_FLAG = as.factor(train$RESPONSE_FLAG)
We have a binary response variable which takes values 0 and 1. Therefore, we can fit Binomial Logistic Regression Model.
glm.model = glm(RESPONSE_FLAG~PARAMETER1+PARAMETER2+PARAMETER3+PARAMETER4
+PARAMETER5+PARAMETER6+PARAMETER7+PARAMETER8+PARAMETER9+
PARAMETER10+PARAMETER11+PARAMETER12+PARAMETER13,data = train,
family=binomial(link='logit'))
summary(glm.model)
##
## Call:
## glm(formula = RESPONSE_FLAG ~ PARAMETER1 + PARAMETER2 + PARAMETER3 +
## PARAMETER4 + PARAMETER5 + PARAMETER6 + PARAMETER7 + PARAMETER8 +
## PARAMETER9 + PARAMETER10 + PARAMETER11 + PARAMETER12 + PARAMETER13,
## family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9189 -0.0891 -0.0564 -0.0336 4.9430
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.51371 0.05276 -123.454 < 2e-16 ***
## PARAMETER1 -0.25467 0.06496 -3.921 8.83e-05 ***
## PARAMETER2 -0.10555 0.04041 -2.612 0.0090 **
## PARAMETER3 -0.02268 0.02903 -0.781 0.4346
## PARAMETER4 0.20878 0.03407 6.128 8.90e-10 ***
## PARAMETER5 0.02706 0.03328 0.813 0.4162
## PARAMETER6 0.04705 0.03727 1.262 0.2069
## PARAMETER7 2.49552 0.04736 52.694 < 2e-16 ***
## PARAMETER8 -0.41794 0.03805 -10.983 < 2e-16 ***
## PARAMETER9 0.20135 0.05144 3.914 9.07e-05 ***
## PARAMETER10 -0.08609 0.03559 -2.419 0.0156 *
## PARAMETER11 -0.80809 0.04902 -16.484 < 2e-16 ***
## PARAMETER12 0.16785 0.02189 7.669 1.73e-14 ***
## PARAMETER13 -0.81089 0.03681 -22.028 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16841.5 on 202793 degrees of freedom
## Residual deviance: 9484.1 on 202780 degrees of freedom
## AIC: 9512.1
##
## Number of Fisher Scoring iterations: 9
The GLM fit gives us very important information about the importance of parameters. As we can observe that some parameters have a very high P value. This signifies that those parameters are not statistically significant to predict the response variable.Therefore, PARAMETER 3 ,PARAMETER 5 and PARAMETER 6 are the least important parameters with extremely high p values.Parameters 7,8,11 and 13 seems to be important parameter. We will do further analysis to see that. Furthermore, we can observe the coefficient values to predict the probability of getting a failure.If our response variable is 0,it indicates no failure and If it 1, it indicates a failure. The positive coefficient(slope) for a predictor indicates there is more probability of failure.This means that if the value of parameter13 increases by 2.5 units(on z scaling),its more likely to fail.
We can analyze the model using table of deviance which tells us that by adding each predictor, what is the amount of deviation observed.
anova(glm.model,test = 'Chisq')
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: RESPONSE_FLAG
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 202793 16841.5
## PARAMETER1 1 0.9 202792 16840.6 0.3490849
## PARAMETER2 1 82.8 202791 16757.8 < 2.2e-16 ***
## PARAMETER3 1 0.0 202790 16757.7 0.8660808
## PARAMETER4 1 0.5 202789 16757.2 0.4682398
## PARAMETER5 1 4.7 202788 16752.5 0.0304563 *
## PARAMETER6 1 17.6 202787 16734.9 2.689e-05 ***
## PARAMETER7 1 6189.3 202786 10545.6 < 2.2e-16 ***
## PARAMETER8 1 204.3 202785 10341.3 < 2.2e-16 ***
## PARAMETER9 1 1.7 202784 10339.7 0.1977043
## PARAMETER10 1 14.0 202783 10325.7 0.0001818 ***
## PARAMETER11 1 227.8 202782 10097.8 < 2.2e-16 ***
## PARAMETER12 1 76.0 202781 10021.8 < 2.2e-16 ***
## PARAMETER13 1 537.8 202780 9484.1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Difference between null deviance and residual deviance shows us that how our model is doing against the null model. Analyzing the residual deviance column, we can see that there is a sgnificant drop in deviation for Parameter 7.This signifies that it is the most important paramater. Furthermore, there is great decrease in Deviation for Parameter 8, 11 and 13. Therefore, these paramters seems to be the most important parameters to detect if there is a failure.
We need to check if the parameters are highly correlated with each other. This will help us exclude the parameters which are highly correlated and will convey no extra information in our model.
ggpairs(unit.data,c('PARAMETER7','PARAMETER11','PARAMETER13','PARAMETER8'))
This tells us that Parameter 7 and parameter 11 are highly correlated (corr=0.715). Therefore, we will drop parameter 11 from best parameters. Thus, the best parameters found after this analysis are: Parameter 7,8 and 13.
Let us check if there is a class balance in our data
summary(factor(unit.data$RESPONSE_FLAG))
## 0 1
## 268502 1889
We can observe a high class imbalance in data. Therefore, we caanot use Accuracy as a measure for performance. ROC curve and Area under the curve will be a good measure for this study.
my.predictions = predict(glm.model,test,type = 'response')
pr <- prediction(my.predictions, test$RESPONSE_FLAG)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
ROC curve indicates that our model is doing well. True positive rate is increasing drasticaly. Anything above the diagonal line,indicates a good model.
Area under the curve is a good measure to find our performance. AUC of 1 indicated that our model is predicting perfectly with no false positives and false negatives. AUC can vary between 0 and 1.
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8507283
Now, we have predicted that top 3 parameters are :PARAMETER 7,PARAMETER 8 and PARAMETER 13. Let us fit glm model and predict on the same validation set.
glm.model = glm(RESPONSE_FLAG~PARAMETER7+PARAMETER8+PARAMETER13,data = train,
family=binomial(link='logit'))
anova(glm.model,test = 'Chisq')
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: RESPONSE_FLAG
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 202793 16842
## PARAMETER7 1 6074.3 202792 10767 < 2.2e-16 ***
## PARAMETER8 1 265.9 202791 10501 < 2.2e-16 ***
## PARAMETER13 1 437.7 202790 10064 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
my.predictions = predict(glm.model,test,type = 'response')
pr <- prediction(my.predictions, test$RESPONSE_FLAG)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8521408
We can observe that there is an increase in AUC value,if we select just three features. Therefore, this confirms that the three parameters(Parameter 7,8 and 13) can better predict if there is a failure or not.
Let us look at the scatter plots of different Parameters.
unscaled_joined.data = merge(unit.data,lot.data,by = "UNIT_ID")
unscaled_joined.data$UNIT_PROCESS_DATE = anydate(unscaled_joined.data$UNIT_PROCESS_DATE)
ggplot(unscaled_joined.data,aes(x = PARAMETER7, y = PARAMETER13, color =factor(RESPONSE_FLAG)))+
geom_point(alpha = 0.5)+xlab("PARAMETER7")+ ylab("PARAMETER13")+labs(color = "Did it fail?")+ scale_color_manual(values =c("#E69F00","#56B4E9"), labels =c("No", "Yes"))
We can observe that Parameter 13 values vary between 0 and 400. Similiarly, Parameter 7 have values varying between 0 to about 2500. It will be difficult to make any predictions beyond this range of values. Also higher values of Parameter 7(above 1000) signifies that there there is a failure. Therefore, we can interpret that most of the failures occur when parameter 13 has lower values and corresponding parameter 7 has higher values.
ggplot(unscaled_joined.data,aes(x = PARAMETER13, y = PARAMETER8, color =factor(RESPONSE_FLAG)))+
geom_point(alpha = 0.5)+xlab("PARAMETER13")+ ylab("PARAMETER8")+labs(color = "Did it fail?")+ scale_color_manual(values =c("#E69F00","#56B4E9"), labels =c("No", "Yes"))
Also, if we plot Paramter 8 and 13 values, most of the data indicates a response of no failure. We cannot make any predictions after looking at this plot.
ggplot(unscaled_joined.data,aes(x = PARAMETER8, y = PARAMETER7, color =factor(RESPONSE_FLAG)))+
geom_point(alpha = 0.5)+xlab("PARAMETER8")+ ylab("PARAMETER7")+labs(color = "Did it fail?")+ scale_color_manual(values =c("#E69F00","#56B4E9"), labels =c("No", "Yes"))
This plot signifies that higher values of Parameter 7 have more failures.There is a range of values of paramter 8 for which there is a failure(about -0.025).
Looking at the three scatter plots, we can say that parameter 7 can solely predict the failures. Therefore, to further analyze more parameters, we will check those with respective to the increase in parameter 7 values.
Since, there is a huge increase in defects from 19th to 25th January, let us analyze by dividing our data into two subsets.
This boxplot indicates that there is some issue with Assembly line FAJ211. As we can observe that there are some failures before mid jan,but after mid january the failures have drastically increased in this assmebly line. Also, for this assembly line Parameter 7 values are large. This tells us that we need to analyze this assembly line that where have the issues occured.
Now, let us check MATERIAL1_SUPPLIER with respect to this assembly line.
We can observe the high number of outliers in FAJ211 for MFG1 and MFG2, in mid january.There are some blank values in data i.e. we have missing supplier information data. If we had data about those material suppliers, we can draw more concrete information about this parameter.
Now,let us check for MATERIAL2_SUPPLIER.
In Mid january, there seem to be a lot of failures for Tech 2 Material supplier. We can observe a large amount of outliers. There are extremely large values for parameter 7,indicating a failure. This can be analyzed as one of the main cause of more failures.
Let us analyze MATERIAL2_SUPPLIER with respect to RESPONSE_FLAG.
This confirms that Tech2 supplier products had some problem during this time period which caused an increase in failure rates.
Let us check MATERIAL2_SUPPLIER_FACILITY.
This shows us that FAJ211 TOOL_ID has some issues in SE Supplier Facility,in mid January.
Let us analyze MATERIAL3_SUPPLIER
This shows us that there are failures in mid january for FAJ211 TOOL_ID and MIXED, Tech 1 supplier. There could be some issue with this supplier during this period. There are extremely high values for TOOL_ID FAJ211 indicating that this tool failed.
The box plot conveys us that there are many failures occuring in different tools on this assembly station. Since, most of the tools show a high increase in failure rates, there could be a problem in assembly station,rather than on individual tools.
Let us check if there is a problem on Front or back Lane of assembly station.
This shows that VD_TOOL_LANE is not dependent on the response variable. Both these boxplots indicate that parameter 7 have comparable values with almost same amount of outliers.
Let us check MATERIAL1_SUPPLIER
We cannot draw any conclusion about Material 1 Supplier. There is a lot of missing data with extreme outliers.
Let us analyze MATERIAL2_SUPPLIER with respect to RESPONSE_FLAG.
Many TOOL_IDs in this assembly station,seems to have a problem in Tech 2 Material Supplier,during Mid january.
Let us check MATERIAL2_SUPPLIER_FACILITY.
This shows that there is some issue with SE facility of Material 2 supplier. This caused an issue for FAJ_TOOL_ID assembly station too.
Let us analyze MATERIAL3_SUPPLIER
This also shows that there are a lot of failures in Tech 1 Material 3 supplier. Thus, this is one of the main root causes of increase in defects.
Let us check NX_TOOL_ID assembly station.
This shows us that NXG1911 and NXG1933 Tool IDs have failures.There are some failures observed in TPS004 TOOL_ID in Mid january.
Let us check if these failures are in particular compartment of the assembly station.
Looking at NXG1911 and NXG1933 Tool IDs, we can observe that there is some discrepancy in the bottom compartment of NXG1911. This is one of the root causes of increase in defects in mid january. Also, in NXG1933, there seems to be some outliers in bottom compartment too. But, we cannot make strong arguments about that tool ID.
Let us check MATERIAL1_SUPPLIER
This boxplot tells us that NXG1911 has a lot of failures for missing Material 1 suppliers(for mid January)
Let us analyze MATERIAL2_SUPPLIER with respect to RESPONSE_FLAG.
This indicates that the increase in failures is due to Tech 2 Material suppliers. Since, all the three assembly lines failed due to this Material 2 supplier, it is a major cause of failures.
Let us check MATERIAL2_SUPPLIER_FACILITY.
This indicates that there is a problem in SE Faciity of Material 2 Suppliers.
Let us analyze MATERIAL3_SUPPLIER
This indicates that Tech 1 Material 3 suupplier has some issues. All the lots of Tech 1 supplier, has some problems. This is causing failures.
Detection of failure at an early stage is vital.Therefore, we need to gather data start from the raw materials to the quality of final product and customer satisfaction. Thus, we require the failure points and data corresponding to that. For instance, we need data about the raw materials. This can help us in predicting the quality of raw materials. It is likely to cause more failures on the assembly station,if life of a machine has diminished. Moreover, we need information about the suppliers(as we have in this case study). It helps us to predict if data from certain suppliers is always faulty.
To keep a check on raw material and suppliers, we can get data about the external ratings of the supplier.This gives an idea about their market reputation.This data is crucial in failure analysis.
From a quality control perspective, control charts play a very important role in detecting failures. They display the limit of statistical variability that can be explained as normal. If our parameter of interest(like parameter 7 in this case study) performs within these limits, it is said to be in control. In such a case, problem could be prevented at one particular level.